{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Fine-tuning a Model on Your Own Data\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb)\n",
    "\n",
    "For many use cases it is sufficient to just use one of the existing public models that were trained on SQuAD or other public QA datasets (e.g. Natural Questions).\n",
    "However, if you have domain-specific questions, fine-tuning your model on custom examples will very likely boost your performance.\n",
    "While this varies by domain, we saw that ~ 2000 examples can easily increase performance by +5-20%.\n",
    "\n",
    "This tutorial shows you how to fine-tune a pretrained model on your own dataset."
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Prepare environment\n",
    "\n",
    "#### Colab: Enable the GPU runtime\n",
    "Make sure you enable the GPU runtime to experience decent speed in this tutorial.\n",
    "**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**\n",
    "\n",
    "<img src=\"https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/colab_gpu_runtime.jpg\">"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Make sure you have a GPU running\n",
    "!nvidia-smi"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "source": [
    "# Install the latest release of Haystack in your own environment \n",
    "#! pip install farm-haystack\n",
    "\n",
    "# Install the latest master of Haystack\n",
    "!pip install grpcio-tools==1.34.1\n",
    "!pip install git+https://github.com/deepset-ai/haystack.git\n",
    "\n",
    "# If you run this notebook on Google Colab, you might need to\n",
    "# restart the runtime after installing haystack."
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "from haystack.nodes import FARMReader"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "\n",
    "## Create Training Data\n",
    "\n",
    "There are two ways to generate training data\n",
    "\n",
    "1. **Annotation**: You can use the [annotation tool](https://github.com/deepset-ai/haystack#labeling-tool) to label your data, i.e. highlighting answers to your questions in a document. The tool supports structuring your workflow with organizations, projects, and users. The labels can be exported in SQuAD format that is compatible for training with Haystack.\n",
    "\n",
    "![Snapshot of the annotation tool](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/annotation_tool.png)\n",
    "\n",
    "2. **Feedback**: For production systems, you can collect training data from direct user feedback via Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned by the API. The API provides a feedback export endpoint to obtain the feedback data for fine-tuning your model further.\n",
    "\n",
    "\n",
    "## Fine-tune your model\n",
    "\n",
    "Once you have collected training data, you can fine-tune your base models.\n",
    "We initialize a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).\n",
    "We recommend using a base model that was trained on SQuAD or a similar QA dataset before to benefit from Transfer Learning effects.\n",
    "\n",
    "**Recommendation**: Run training on a GPU.\n",
    "If you are using Colab: Enable this in the menu \"Runtime\" > \"Change Runtime type\" > Select \"GPU\" in dropdown.\n",
    "Then change the `use_gpu` arguments below to `True`"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "source": [
    "reader = FARMReader(model_name_or_path=\"distilbert-base-uncased-distilled-squad\", use_gpu=True)\n",
    "data_dir = \"data/squad20\"\n",
    "# data_dir = \"PATH/TO_YOUR/TRAIN_DATA\" \n",
    "reader.train(data_dir=data_dir, train_filename=\"dev-v2.0.json\", use_gpu=True, n_epochs=1, save_dir=\"my_model\")"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stderr",
     "text": [
      "04/28/2020 14:39:27 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
      "04/28/2020 14:39:27 - INFO - farm.infer -   Could not find `distilbert-base-uncased-distilled-squad` locally. Try to download from model hub ...\n",
      "04/28/2020 14:39:29 - WARNING - farm.modeling.language_model -   Could not automatically detect from language model name what language it is. \n",
      "\t We guess it's an *ENGLISH* model ... \n",
      "\t If not: Init the language model by supplying the 'language' param.\n",
      "04/28/2020 14:39:31 - WARNING - farm.modeling.prediction_head -   Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {\"loss_ignore_index\": -1}\n",
      "04/28/2020 14:39:33 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
      "04/28/2020 14:39:33 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None\n",
      "Preprocessing Dataset data/squad20/dev-v2.0.json: 100%|██████████| 1204/1204 [00:02<00:00, 402.13 Dicts/s]\n",
      "Train epoch 0/1 (Cur. train loss: 0.0000):   0%|          | 0/1213 [00:00<?, ?it/s]"
     ]
    }
   ],
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# Saving the model happens automatically at the end of training into the `save_dir` you specified\n",
    "# However, you could also save a reader manually again via:\n",
    "reader.save(directory=\"my_model\")"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# If you want to load it at a later point, just do:\n",
    "new_reader = FARMReader(model_name_or_path=\"my_model\")"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## About us\n",
    "\n",
    "This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
    "\n",
    "We bring NLP to the industry via open source!  \n",
    "Our focus: Industry specific language models & large scale QA systems.  \n",
    "  \n",
    "Some of our other work: \n",
    "- [German BERT](https://deepset.ai/german-bert)\n",
    "- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
    "- [FARM](https://github.com/deepset-ai/FARM)\n",
    "\n",
    "Get in touch:\n",
    "[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
    "\n",
    "By the way: [we're hiring!](https://www.deepset.ai/jobs)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}